Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set

نویسندگان

  • Jeffrey W. Miller
  • Brenda Betancourt
  • Abbas Zaidi
  • Hanna Wallach
  • Rebecca C. Steorts
چکیده

Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman–Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some tasks, this assumption is undesirable. For example, when performing entity resolution, the size of each cluster is often unrelated to the size of the data set. Consequently, each cluster contains a negligible fraction of the total number of data points. Such tasks therefore require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new model that exhibits this property. We compare this model to several commonly used clustering models by checking model fit using real and simulated data sets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Flexible Models for Microclustering with Application to Entity Resolution

Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman–Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate....

متن کامل

Probabilistic Size-constrained Microclustering

Microclustering refers to clustering models that produce small clusters or, equivalently, to models where the size of the clusters grows sublinearly with the number of samples. We formulate probabilistic microclustering models by assigning a prior distribution on the size of the clusters, and in particular consider microclustering models with explicit bounds on the size of the clusters. The com...

متن کامل

Simulation of Fabrication toward High Quality Thin Films for Robotic Applications by Ionized Cluster Beam Deposition

The most commonly used method for the production of thin films is based on deposition of atoms or molecules onto a solid surface. One of the suitable method is to produce high quality metallic, semiconductor and organic thin film is Ionized cluster beam deposition (ICBD), which are used in electronic, robotic, optical, optoelectronic devices. Many important factors such as cluster size, cluster...

متن کامل

To Express Required CT-Scan Resolution for Porosity and Saturation Calculations in Terms of Average Grain Sizes

Despite advancements in specifying 3D internal microstructure of reservoir rocks, identifying some sensitive phenomenons are still problematic particularly due to image resolution limitation. Discretization study on such CT-scan data always has encountered with such conflicts that the original data do not fully describe the real porous media. As an alternative attractive approach, one can recon...

متن کامل

Impact of region of interest size and location in Gafchromic film dosimetry

Introduction: Accurate film dosimetry requires careful consideration of sources of uncertainty. Some of the sources of uncertainty are dependent on the size and location of region of interest (ROI), especially in small fields. Avoiding the penumbra is often a reason for using a small ROI. In contrast, choosing very small ROIs may increase uncertainty due to the reduction of th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015